skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Patel, Pratyush"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss. 
    more » « less
  2. Climate change is a pressing threat to planetary well-being that can be addressed only by rapid near-term actions across all sectors. Yet, the cloud computing sector, with its increasingly large carbon footprint, has initiated only modest efforts to reduce emissions to date; its main approach today relies on cloud providers sourcing renewable energy from a limited global pool of options. We investigate how to accelerate cloud computing's efforts. Our approach tackles carbon reduction from a software standpoint by gradually integrating carbon awareness into the cloud abstraction. Specifically, we identify key bottlenecks to software-driven cloud carbon reduction, including (1) the lack of visibility and disaggregated control between cloud providers and users over infrastructure and applications, (2) the immense overhead presently incurred by application developers to implement carbon-aware application optimizations, and (3) the increasing complexity of carbon-aware resource management due to renewable energy variability and growing hardware heterogeneity. To overcome these barriers, we propose an agile approach that federates the responsibility and tools to achieve carbon awareness across different cloud stakeholders. As a key first step, we advocate leveraging the role of application operators in managing large-scale cloud deployments and integrating carbon efficiency metrics into their cloud usage workflow. We discuss various techniques to help operators reduce carbon emissions, such as carbon budgets, service-level visibility into emissions, and configurable-yet-centralized resource management optimizations. 
    more » « less
  3. As modern server GPUs are increasingly power intensive, better power management mechanisms can significantly reduce the power consumption, capital costs, and carbon emissions in large cloud datacenters. This letter uses diverse datacenter workloads to study the power management capabilities of modern GPUs. We find that current GPU management mechanisms have limited compatibility and monitoring support under cloud virtualization. They have sub-optimal, imprecise, and non-intuitive implementations of Dynamic Voltage and Frequency Scaling (DVFS) and power capping. Consequently, efficient GPU power management is not widely deployed in clouds today. To address these issues, we make actionable recommendations for GPU vendors and researchers. 
    more » « less
  4. Abstract The performance of superconducting qubits is degraded by a poorly characterized set of energy sources breaking the Cooper pairs responsible for superconductivity, creating a condition often called “quasiparticle poisoning”. Both superconducting qubits and low threshold dark matter calorimeters have observed excess bursts of quasiparticles or phonons that decrease in rate with time. Here, we show that a silicon crystal glued to its holder exhibits a rate of low-energy phonon events that is more than two orders of magnitude larger than in a functionally identical crystal suspended from its holder in a low-stress state. The excess phonon event rate in the glued crystal decreases with time since cooldown, consistent with a source of phonon bursts which contributes to quasiparticle poisoning in quantum circuits and the low-energy events observed in cryogenic calorimeters. We argue that relaxation of thermally induced stress between the glue and crystal is the source of these events. 
    more » « less
  5. null (Ed.)
  6. Smartwatches have the potential to provide glanceable, always-available sound feedback to people who are deaf or hard of hearing (DHH). We present SoundWatch, a smartwatch-based deep learning application to sense, classify, and provide feedback about sounds occurring in the environment. To design SoundWatch, we first examined four low-resource sound classification models across four device architectures: watch-only, watch+phone, watch+phone+cloud, and watch+cloud. We found that the best model, VGG-lite, performed similar to the state of the art for nonportable devices although requiring substantially less memory (∼1/3rd) and that the watch+phone architecture provided the best balance among CPU, memory, network usage, and latency. Based on these results, we built and conducted a lab evaluation of our smartwatch app with eight DHH participants. We found support for our sound classification app but also uncovered concerns with misclassifications, latency, and privacy. 
    more » « less